print(2+2)
4
Before we get started, we need to set a few things up. GitHub is a platform for software development and version control using Git, allowing developers to store and manage their code. Think of it as google docs but for code– it will be very useful for collaborating on your group projects later in the term, and in your future as a data analyst.
Voila! You now have a version of this notebook saved to your own GitHub account. You will need to do step 3 for all the workshops! Now, on to python.
In this course, we’ll make extensive use of Python, a programming language used widely in scientific computing and on the web. We will be using Python as a way to manipulate, plot and analyse data. This isn’t a course about learning Python, it’s about working with data - but we’ll learning a little bit of programming along the way.
By now, you should have done the prerequisites for the module, and understand a bit about how Python is structured, what different commands do, and so on - this is a bit of a refresher to remind you of what we need at the beginning of term.
The particular flavour of Python we’re using is iPython, which, as we’ve seen, allows us to combine text, code, images, equations and figures in a Notebook. This is a cell, written in markdown - a way of writing nice text. Contrast this with code cell, which executes a bit of Python:
print(2+2)
4
The Notebook format allows you to engage in what Don Knuth describes as Literate Programming:
[…] Instead of writing code containing documentation, the literate programmer writes documentation containing code. No longer does the English commentary injected into a program have to be hidden in comment delimiters at the top of the file, or under procedure headings, or at the end of lines. Instead, it is wrenched into the daylight and made the main focus. The “program” then becomes primarily a document directed at humans, with the code being herded between “code delimiters” from where it can be extracted and shuffled out sideways to the language system by literate programming tools. Ross Williams
We will work with a number of libraries, which provide additional functions and techniques to help us to carry out our tasks.
These include:
Pandas: we’ll use this a lot to slice and dice data
matplotlib: this is our basic graphing software, and we’ll also use it for mapping
nltk: The Natural Language Tool Kit will help us work with text
We aren’t doing all this to learn to program. We could spend a whole term learning how to use Python and never look at any data, maps, graphs, or visualisations. But we do need to understand a few basics to use Python for working with data. So let’s revisit a few concepts that you should have covered in your prerequisites.
Python can broadly be divided in verbs and nouns: things which do things, and things which are things. In Python, the verbs can be commands, functions, or methods. We won’t worry too much about the distinction here - suffice it to say, they are the parts of code which manipulate data, calculate values, or show things on the screen.
The simplest proper noun object in Python is the variable. Variables are given names and store information. This can be, for example, numeric, text, or boolean (true/false). These are all statements setting up variables:
n = 1
t = “hi”
b = True
Now let’s try this in code:
= 1
n
= "hi"
t
= True b
Note that each command is on a new line; other than that, the syntax of Python should be fairly clear. We’re setting these variables equal to the letters and numbers and phrases and booleans. What’s a boolean?
The value of this is we now have values tied to these variables - so every time we want to use it, we can refer to the variable:
n
1
t
'hi'
b
True
Because we’ve defined these variables in the early part of the notebook, we can use them later on.
Advanced: where do classes fit into this noun/verb picture of variables and commands?
When we work in excel and text editors, we’re used to seeing the data onscreen - and if we manipulate the data in some way (averaging or summing up), we see both the inputs and outputs on screen. The big difference in working with Python is that we don’t see our variables all of the time, or the effect we’re having on them. They’re there in the background, but it’s usually worth checking in on them from time to time, to see whether our processes are doing what we think they’re doing.
This is pretty easy to do - we can just type the variable name, or “print(variable name)”:
= n+1
n print(n)
print(t)
print(b)
2
hi
True
Python, in common with all programming languages, executes commands in a sequence - we might refer to this as the “ineluctable march of the machines”, but it’s more common referred to as the flow of the code (we’ll use the word “code” a lot - it just means commands written in the programming language). In most cases, code just executes in the order it’s written. This is true within each cell (each block of text in the notebook), and it’s true when we execute the cells in order; that’s why we can refer back to the variables we defined earlier:
print(n)
2
If we make a change to one of these variables, say n:
= 3 n
and execute the above “print n” command, you’ll see that it has changed n to 3. So if we go out of order, the obvious flow of the code is confused. For this reason, try to write your code so it executes in order, one cell at a time. At least for the moment, this will make it easier to follow the logic of what you’re doing to data.
Advanced: what happens to this flow when you write functions to automate common tasks?
Exercise - Setting up variables:
Create a new cell.
Create the variables “name”, and assign your name to it.
Create a variable “Python” and assign a score out of 10 to how much you like Python.
Create a variable “prior” and if you’ve used Python before, assign True; otherwise assign False to the variable
Print these out to the screen
Lets fetch the data we will be using for this session. There are two ways in which you can upload data to the Colab notebook. You can use the following code to upload a CSV or similar data file.
from google.colab import files
= files.upload() uploaded
Or you can use the following cell to fetch the data directly from the QM2 server.
Let’s create a folder that we can store all our data for this session
!mkdir data
!mkdir ./data/wk1
!curl https://s3.eu-west-2.amazonaws.com/qm2/wk1/data.csv -o ./data/wk1/data.csv
!curl https://s3.eu-west-2.amazonaws.com/qm2/wk1/sample_group.csv -o ./data/wk1/sample_group.csv
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 203 100 203 0 0 2872 0 --:--:-- --:--:-- --:--:-- 3029
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 297 100 297 0 0 1844 0 --:--:-- --:--:-- --:--:-- 1879
Typically, data we look at won’t be just one number, or one bit of text. Python has a lot of different ways of dealing with a bunch of numbers: for example, a list of values is called a list:
= [1,2,3,6,9]
listy print(listy)
[1, 2, 3, 6, 9]
A set of values linked to an index (or key) is called a dictionary; for example:
= {'Bob': 1.2, 'Mike': 1.2, 'Coop': 1.1, 'Maddy': 1.3, 'Giant': 2.1}
dicty print(dicty)
{'Bob': 1.2, 'Mike': 1.2, 'Coop': 1.1, 'Maddy': 1.3, 'Giant': 2.1}
Notice that the list uses square brackets with values separated by commas, and the dict uses curly brackets with pairs separated by commas, and colons (:) to link a key (index or address) with a value.
(You might notice that they haven’t printed out in the order you entered them)
*Advanced: Print out 1) The third element of listy, and 2) The element of dicty relating to Giant
We’ll discuss different ways of organising data again soon, but for now we’ll look at dataframes - the way our data-friendly library Pandas works with data. We’ll be using Pandas a lot this term, so it’s good to get started with it early.
Let’s start by importing pandas. We’ll also import another library, but we’re not going to worry about that too much at the moment.
If you see a warning about ‘Building Font Cache’ don’t worry - this is normal.
import pandas
import matplotlib
%matplotlib inline
Let’s import a simple dataset and show it in pandas. We’ll use a pre-prepared “.csv” file, which needs to be in the same folder as our code.
= pandas.read_csv('./data/wk1/data.csv')
data data.head()
Name | First Appearance | Approx height | Gender | Law Enforcement | |
---|---|---|---|---|---|
0 | Bob | 1.2 | 6.0 | Male | False |
1 | Mike | 1.2 | 5.5 | Male | False |
2 | Coop | 1.1 | 6.0 | Male | True |
3 | Maddy | 1.3 | 5.5 | Female | False |
4 | Giant | 2.1 | 7.5 | Male | False |
What we’ve done here is read in a .csv file into a dataframe, the object pandas uses to work with data, and one that has lots of methods for slicing and dicing data, as we will see over the coming weeks. The head() command tells iPython to show the first few columns/rows of the data, so we can start to get a sense of what the data looks like and what sort of type of objects is represents.
A common first step for exploring our data is to sort it. In Pandas, this can be done easily with the sort_values()
function. We can specify which column to sort the data by, and whether we want to sort in ascending or descending order, using the optional arguments by
and ascending
, respectively. In the example below, we’re sorting in descending order of height:
='Approx height', ascending=False).head() data.sort_values(by
Name | First Appearance | Approx height | Gender | Law Enforcement | |
---|---|---|---|---|---|
4 | Giant | 2.1 | 7.5 | Male | False |
0 | Bob | 1.2 | 6.0 | Male | False |
2 | Coop | 1.1 | 6.0 | Male | True |
1 | Mike | 1.2 | 5.5 | Male | False |
3 | Maddy | 1.3 | 5.5 | Female | False |
If you’ve gotten this far, congratulations! To further hone your skills, try working your way through the five intro to programming notebooks on Kaggle. These cover a range of skills that we’ll be using throughout the term. Kaggle is a very useful resource for learning data science, so making an account may not be a bad idea!
The URL below contains a dataset of the most streamed songs on spotify in 2023: https://storage.googleapis.com/qm2/wk1/spotify-2023.csv
./data/wk1/
directory.QUESTION: which artist has the song with the highest number of streams?
# use this code cell to answer the question